204
Clustering In cluster analysis (group analysis), a distinction is made between supervised
(groups known) and unsupervised clustering (groups unknown). An example of super
vised clustering is the k-nearest-neighbour algorithm, in which new data (e.g. patient) are
classified into predefined groups (the k-nearest neighbour [e.g. k = 3 considered 3-nearest
neighbours] is always considered and then assigned to a cluster). This allows, for example,
to assign a diseased person to an optimal therapy (e.g. radiation, chemotherapy) according
to the gene expression profile. If, on the other hand, one wants to find clusters in one’s
data, one can apply unsupervised clustering, such as k-means (non-hierarchical; often for
NP problems) or complete-linkage (hierarchical).
Regression Regression analyses examine the relationship between a dependent (regres
sand, criterion, “response variable”) and independent (predictor, “predictor variable”)
variable. For example, a linear regression can be used to examine the relationship between
weight (independent variable) and blood pressure (dependent variable). The prerequisite is
that the dependent and independent variables are metric. The calculation is done with the
least-squares estimator, which tries to minimize the least-squares error of the residuals
(distance from the data point to the regression line) to get the best fit to the data (you put
a regression line in the data). How well the regression model represents the data (goodness
of fit) is usually calculated with a t-test (p-value < 0.05) and the R2 (coefficient of determi
nation, between 0 [no correlation] and 1 [high linear correlation]).
If, on the other hand, the dependent variable is a binary/dichotomous (yes/no) variable,
logistic regression can be used. Here is a popular analysis question: What is the probability
(the correlation) of developing high blood pressure (or heart failure) if you are overweight?
The calculation is done using the logit function [log(p/1−p)] and maximum likelihood
method (you put a sigmoidal curve in the data). To assess the model quality, one uses, for
example, a chi-square test (p-value <0.05), the R2 and the AIC (Akaike Information
Criterion; adjusts model fitting to the number of parameters used).
Often one also has time points (time data) for a dependent variable, for example in the
context of a follow-up study. Cox regression is used for this (survival time analysis, “time
to event” analysis). Survival time analyses are of interest if one wants to know, for exam
ple, what the influence of a mutation or therapy is on the 5-year survival time. The calcula
tion is done using the Kaplan-Meier estimator (hazard function calculates risk [failure
rate] that event actually occurred; censored data [no exact information about event] are not
included in the calculation). The survival rates are represented in a Kaplan-Meier curve,
and the model quality is assessed using a log-rank test and Cox proportional hazards.
A nice overview of regression analysis is provided by Worster et al. (2007), Schneider
et al. (2010), Singh and Mukhopadhyay (2011) and Zwiener et al. (2011). The two recent
papers on remdesivir (Wang et al. 2020) and lopinavir-ritonavir (Cao et al. 2020) treatment
in COVID-19 should also be mentioned here.
Logistic regression and Cox regression are popular for the analysis of diagnostic and
prognostic signatures, i.e. the optimal combination of genes (Vey et al. 2019) or
14 We Can Think About Ourselves – The Computer Cannot